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Patrolling  in  a  Stochastic  Environment 

Sui  Ruan^,  Candra  Meirina^,  Feili  Yu^,  Krishna  Pattipati^  and  Robert  L.  Popp  ^ 


Abstract 

The  patrolling  problem  considered  in  this  paper  has  the  following  characteristics:  Patrol  units  conduct  preventive 
patrolling  and  respond  to  call-for- service.  The  patrol  locations  (nodes)  have  different  priorities,  and  varying  incident 
rates.  We  design  a  patrolling  scheme  such  that  the  locations  are  visited  based  on  their  importance  and  incident 
rates.  The  solution  is  accomplished  in  two  steps.  First,  we  partition  the  set  of  nodes  of  interest  into  subsets  of 
nodes,  called  sectors.  Each  sector  is  assigned  to  one  patrol  unit.  Second,  for  each  sector,  we  exploit  a  response 
strategy  of  preemptive  call-for- service  response,  and  design  multiple  sub-optimal  off-line  patrol  routes.  The  net 
effect  of  randomized  patrol  routes  with  immediate  call-for- service  response  would  allow  the  limited  patrol  resources 
to  provide  prompt  response  to  random  requests,  while  effectively  covering  the  nodes  of  different  priorities  having 
varying  incidence  rates.  To  obtain  multiple  routes,  we  design  a  novel  learning  algorithm  (Similar  State  Estimate 
Update)  under  a  Markov  Decision  Process  (MDP)  framework,  and  apply  softmax  action  selection  method.  The 
resulting  patrol  routes  and  patrol  unit  visibility  would  appear  unpredictable  to  the  insurgents  and  criminals,  thus 
creating  the  impression  of  virtual  police  presence  and  potentially  mitigating  large  scale  incidents. 


I.  Introduction 

In  a  highly  dynamic  and  volatile  environment,  such  as  a  post-conflict  stability  operation  or  a  troubled  neighbor¬ 
hood,  military  and/or  police  units  conduct  surveillance  via  preventive  patrolling,  together  with  other  peace  keeping 
or  crime  prevention  activities.  Preventive  patrol  constitutes  touring  an  area,  with  the  patrol  units  scanning  for 
threats,  attempting  to  prevent  incidents,  and  intercepting  any  threats  in  progress.  Effective  patrolling  can  prevent 
small  scale  events  from  cascading  into  large  scale  incidents,  and  can  enhance  civilian  security.  Consequently,  it 
is  a  major  component  of  stability  operations  and  crime  prevention.  In  crime  control,  for  example,  for  the  greatest 
number  of  civilians,  deterrence  through  ever-present  police  patrol,  coupled  with  the  prospect  of  speedy  police 
action  once  a  report  is  received,  appears  crucial  in  that  the  presence  or  potential  presence  of  police  officers  on 
patrol  severely  inhibits  criminal  activity[l].  Due  to  limited  patrolling  resources(e.g.,  manpower,  vehicles,  sensing 
and  shaping  resources),  optimal  resource  allocation  and  planning  of  patrol  effort  are  critical  to  effective  stability 
operations  and  crime  prevention [2]. 

The  paper  is  organized  as  follows:  In  section  II,  the  stochastic  patrolling  problem  is  modeled.  In  section  III, 
we  propose  a  solution  approach  based  on  a  MDP  framework.  Simulation  results  are  presented  in  section  IV.  In 
section  V,  the  paper  concludes  with  a  summary  and  future  research  directions. 

II.  Stochastic  Patrolling  Model 

The  patrolling  problem  is  modeled  as  follows: 

•  A  finite  set  of  nodes  of  interest:  K  =  {/;  /  =  !,..,/}.  Each  node  /  G  K  has  the  following  attributes: 

“Electrical  and  Computer  Engineering  Department,  University  of  Connecticut,  Storrs,  CT  06269-1157,  USA.  E-mail:  [sruan,  meirina, 
yu02001,  krishna]@ engr.uconn.edu,  ^Information  Exploitation  Office,  DARPA,  3701  N.  Fairfax  Drive,  Arlington,  VA22203,  USA.  Email: 
rpopp@darpa.mil.  This  work  is  supported  by  the  Office  of  Naval  Research  under  contract  No.  00014-00-1-0101. 
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-  fixed  location  {xi^yi)\ 

-  incident  rate  \i{l/hour)\  we  assume  that  the  number  of  incident  occurring  at  node  i  in  a  time  interval 
(^1,^2),  denoted  by  111(^2,  fi),  is  a  Poisson  random  variable  with  parameter  \i{t2  —  fi): 


P(ni(t2,ti)  =  fc)  = 


k\ 


-  importance  index  5i\  a  value  indicating  the  relative  importance  of  node  i  in  the  patrolling  area. 

•  The  connectivity  of  the  nodes:  for  any  node  j  directly  connected  to  node  i,  we  denote  it  as  j  G  adj{i),  and 
the  length  of  the  edge  connecting  them  as 

•  A  finite  set  of  identical  patrol  units,  each  with  average  speed  v,  i.e.,  the  estimated  time  for  a  unit,  t,  to  cover 
a  distance  d,  is  f  =  ^.  Each  unit  would  respond  to  a  call-for-service  immediately  when  a  request  is  received; 
otherwise,  the  patrol  unit  traverses  along  prescribed  routes. 

In  this  paper,  we  focus  our  attention  on  the  problem  of  routing  for  effective  patrolling,  and  assume  that 
whenever  a  patrol  unit  visits  a  node,  the  unit  can  clear  all  incidents  on  that  node  immediately.  Some  real  world 
constraints,  such  as  the  resources  required  and  incident  clearing  times  are  not  considered;  future  work  would 
address  these  extensions. 


III.  Proposed  Solution 

Our  solution  to  the  patrolling  problem  consists  of  two  steps.  First,  we  partition  the  set  of  nodes  of  interest 
(corresponding  to  a  city  for  example)  into  subsets  of  nodes  called  sectors.  Each  sector  is  assigned  to  one  patrol 
unit.  Second,  for  each  sector,  we  exploit  a  response  strategy  of  preemptive  call-for-service  response,  and  design 
multiple  off-line  patrol  routes.  The  patrol  unit  randomly  selects  predefined  routes  to  conduct  preventive  patrolling; 
whenever  a  call-for-service  request  is  received,  the  patrol  unit  would  stop  the  current  patrol  and  respond  to  the  request 
immediately;  after  completing  the  call-for-service,  the  patrol  unit  would  resume  the  suspended  patrol  route.  The  net 
effect  of  randomized  patrol  routes  with  immediate  call-for-service  response  would  allow  limited  patrol  resources 
to  provide  prompt  response  to  random  requests,  while  effectively  covering  the  nodes  of  different  priorities  having 
varying  incidence  rates. 

The  sector  partitioning  sub-problem  is  formulated  as  a  combinatorial  optimization  problem,  and  solved  via 
political  districting  algorithms  presented  in  [5].  The  off-line  route  planning  subproblem  for  each  sector  is  formulated 
as  an  infinite-horizon  Markov  Decision  Process  (MDP)[4],  based  on  which  a  novel  learning  method,  viz..  Similar 
State  Estimate  Update,  is  applied.  Furthermore,  we  apply  Softmax  action  selection  method[8]  to  prescribe  multiple 
patrol  routes  to  create  the  impression  of  virtual  patrol  presence  and  unpredictability. 

A.  Area  partitioning  for  patrol  unit  assignment 

The  problem  of  partitioning  a  patrol  area  can  be  formulated  as  follows: 

•  A  region  is  composed  of  a  finite  set  of  nodes  of  interest:  K  =  {i;  i  =  1, Each  node  i  G  K  is  centered 
at  position  {xi^yi),  and  value  pi  —  \i5i\ 
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•  There  are  r  areas  to  cover  the  region,  such  that  all  nodes  are  covered,  with  minimum  over  lap,  and  the  sum 
of  values  for  each  area  is  similar,  and  areas  are  compact. 

This  is  a  typical  political  districting  problem.  Dividing  a  region,  such  as  a  state,  into  small  areas,  termed  districts, 
to  elect  political  representatives  is  called  political  districting [6].  A  region  consists  of  /  population  units  such  as 
counties  (or  census  tracks),  and  the  population  units  must  be  grouped  together  to  form  r  districts.  Due  to  court 
rulings  and  regulations,  the  deviation  of  the  population  per  district  cannot  exceed  a  certain  proportion  of  the  average 
population.  In  addition,  each  district  must  be  contiguous  and  compact.  A  district  is  contiguous,  if  it  is  possible 
to  reach  any  two  places  of  the  district  without  crossing  another  district.  Compactness  essentially  means  that  the 
district  is  somewhat  circular  or  a  square  in  shape  rather  than  a  long  and  thin  strip.  Such  shapes  reduce  the  distance 
of  the  population  units  to  the  center  of  the  district  or  between  two  population  centers  of  a  district.  This  problem 
was  extensively  studied  in  [5],  [6]. 

B.  Optimal  Routing  in  a  Sector 

1)  MDP  modeling:  In  a  sector,  there  are  n  nodes  of  interest,  N  =  {1,  ...,n}  C  K.  A  Markov  Decision  Process 
(MDP)  representation  of  the  patrolling  problem  is  as  follows: 

o  Decision  epochs  are  discretized  such  that  each  decision  epoch  begins  at  the  time  instant  when  the  patrol  unit 
finishes  checking  on  a  node,  and  needs  to  move  to  a  next  node;  the  epoch  ends  at  the  time  instant  when  the 
patrol  unit  reaches  the  next  node,  and  clears  all  incidents  at  that  node. 

o  States  {s}  :  a  state,  defined  at  the  beginning  of  decision  epoch  t,  is  denoted  as  5  =  {^,w},  where  i  ^  N  is 
the  node  the  patrol  unit  is  currently  located  at,  and  w  =  denotes  the  times  elapsed  since  the  nodes 

are  last  visited; 

o  Actions  {a}:  an  action,  also  defined  at  the  beginning  of  decision  epoch  t,  is  denoted  as  a  =  (i,  j),  where  i  is 
the  patrol  unit’s  current  location,  and  j  G  adj{i),  an  adjacent  node  of  i,  denotes  the  next  node  to  be  visited; 

o  State  transition  probabilities  P{s'\s^a):  given  state  s,  and  action  a,  the  probability  of  s'  being  the  next  state; 

o  Reward  g{s^a^s')\  the  reward  for  taking  action  a  =  (i,  j)  at  state  s  =  (i,w)  to  reach  next  state  s'  =  (j,wO- 
At  time  t' ,  the  patrol  unit  reaches  node  j  and  clears  nj(f')  incidents,  and  earns  the  reward  at  time  t'  of 
g{s,a,s')  =  5jnj{t'). 

o  Discount  mechanism:  the  reward  g  potentially  earned  at  future  time  t'  is  valued  as  at  current  time 

f,  where  f3  is  the  discount  rate; 

o  Objective  is  to  determine  an  optimal  policy,  i.e.,  a  mapping  from  states  to  actions,  such  that  the  overall  expected 
reward  is  maximized. 

The  value  function  (expected  reward)  of  a  state,  s  at  time  for  policy  n  (a  mapping  from  state  to  action)  is 
defined  as: 

oo 

V^{s)  ^ 

fc=o 


(1) 
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where  is  the  reward  earned  at  time  tk+i-  Note  that  is  independent  of  time,  t,  i.e.,  a  constant  state- 

dependent  stationary  value  corresponding  to  a  stationary  policy. 

Dynamic  Programming  [4]  [7]  and  Reinforcement  Learning  [8]  can  be  employed  to  solve  the  MDP  problem. 
In  this  work,  we  first  prove  that  under  any  deterministic  policy  11,  the  structure  of  value  function  of  a  state, 
s  =  (i,w),  is  a  linear  function:  1/^(5  =  (^,w))  =  (c^(5))^w  +  dY{s).  Therefore,  the  optimal  policy  satisfies 
'L*(5  =  (i,  w))  =  (c*(5))^w  +  Here,  we  denote  0^(5),  ^{s)  as  the  parameters  for  policy  11,  while  €*(5), 

d^{s)  are  the  concomitant  parameters  for  the  optimal  policy  H*.  Based  on  this  structure,  we  construct  the  linear 
function  as  an  approximation  of  optimal  value  function,  denoted  as:  ^*(5  =  (i,  w))  =  (c*)^w+(i*,  where  c*  and  (i* 
are  constants  independent  of  {w}.  This  special  structure  of  the  value  function  enables  us  to  design  a  novel  learning 
algorithm,  the  so-called  Similar  State  Estimate  Update(5'S'£’f/)  to  obtain  a  deterministic  near-optimal  policy,  from 
which  a  near-optimal  patrolling  route  can  be  obtained.  The  SSEU  algorithm  employs  the  ideas  from  Monte-Carlo 
and  Temporal  Difference  (specifically,  TD(0))  methods[8]),  while  overcoming  the  inefficiencies  of  these  methods 
on  the  patrolling  problem. 

At  state  s  =  when  action  a  —  (i^j)  is  undertaken,  the  state  transverses  to  s'  =  {j,  Note  that 

under  our  modeling  assumption,  the  state  transition  by  action  a  is  deterministic,  while  the  reward  accrued  by  action 
a  at  state  s  is  stochastic  in  the  sense  that  the  number  of  incidents  at  node  j  is  random.  Therefore,  the  Bellman’s 
equation  for  the  patrolling  problem  can  be  simplified  as: 

'L*(5)  =  m8ixE[e~^~^g{s^a^s')  +  e~^~^V{s')\s^a\  (2) 

a 

=  maxa(5,  s'){E[g{s^  a,  5')]  + 

a 

Here  g{s^a^s')  is  the  reward  for  taking  action  a  =  (i,j)  at  state  s  =  (i,w)  to  reach  state  s'  =  (j,w').  The 
expected  reward  is  E[g{s^a^s')]  =  6jXj[wj  +  ^],  and  0^(5,  5')  =  accounts  for  discount  factor  for  state 

transition  from  5  to  5'. 

The  greatest  challenge  in  using  MDPs  as  the  basis  for  decision  making  lies  in  discovering  computationally 
feasible  methods  for  the  construction  of  optimal,  approximately  optimal  or  satisfactory  policies [7].  Arbitrary  MDP 
problems  are  intractable;  producing  even  satisfactory  or  approximately  optimal  policies  is  generally  infeasible. 
However,  many  realistic  application  domains  exhibit  considerable  structure  and  this  structure  can  be  exploited  to 
obtain  efficient  solutions.  Our  patrolling  problem  falls  into  this  category. 

Theorem  1:  For  any  deterministic  policy  in  the  patrolling  problem,  i.e.,  11  :  s  — >  a.  Vs  G  S,Va  G  A(5),  the 
state  value  function  has  the  following  property: 

^"(5  =  (*,  w))  =  {cf{s)fYL  +  df{s)  \/ieN  (3) 

Proof:  Under  any  deterministic  policy,  11,  for  an  arbitrary  state  s  =  (i,  w)  at  t,  the  follow-on  state  trajectory  is 
deterministic  as  the  state  transition  is  deterministic  in  the  patrolling  problem.  We  denote  the  state  trajectory  in  a 
format  ‘‘node{time^  reward^  as: 
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io(=  i){t,0)  — ^  ii{t  +  Ti,ri] 


Thus,  the  value  function  of  state  s  under  policy  11  is 


+  Tjv,  vn) 


1/^(5  =  (i,  (w))  =  E[Y^  rfce  fi 


where  Vk  is  the  reward  earned  at  decision  epoch  tk  and  fij  signifies  its  expected  sum  of  rewards  earned  at  node  j. 
Since  the  sequence  of  visits  to  node  j  is: 


j(f  +  Tyi,ryi; 


j{t  +  Tj^2,rj,2) 


j{t  +  Tj^N,rj,N), 


and  expected  reward  of  first  visit  to  node  j  following  state  s  is:  E{rj^i)  =  6jXj{wj  +  Tj^i)e  and  {k  >  1) 

visit  to  node  j  is  E{rj^k)  =  ^j^j{Tj,k  —  ■  Therefore,  we  have 

fij  ~  ^  [^i,2  “  ^  ~  rj,7v-i]e  ^  (7) 

—  C'ij  U)j  dij  . 

Here,  Cij  —  SjXjc  ^  and  dij  —  -i  dj \j [^7 . fe  ^  Since  Tj^k  i?  —  l,...,oc)  are 

dependent  on  policy  H  and  state  5,  we  have  V^{s  =  (*,w))  =  (cf(s))^w  +  di(s).  ■ 

Based  on  this  observation,  we  employ  linear  function  approximation  for  V  *  (5)  as  follows: 

V{s  =  (i,  w))  =  V*{s  =  (i,  w))  ^  (c*)^w  +  (i*;  \/i  e  N  (8) 

where  c*  =  c*^  is  the  expected  value  of  j  =  under  optimal  policy  H*;  rf*  is  the 

expected  value  of  Yl]=i  YlT=i  under  optimal  policy  H*. 

Starting  from  an  arbitrary  policy,  we  could  employ  the  following  value  and  policy  iteration  method[8]  to  evaluate 
and  improve  the  policies  iteratively  to  gradually  approach  an  optimal  policy, 

1/*+^=  m&x  a{s,s'){E[g{s,  a  =  {i,j),  s')] +  V\s')}.  (9) 

\/a=(i,j),  jeadj(i) 

—  arg  max  (a(5,  s){E[g{s^  a  =  (i,  j),  5')]  + 1/^(5')}.  (10) 

Va=(7,j),  jeadj{i) 

2)  Similar  State  Estimate  Update  Method  (Learning  Algorithm):  We  seek  to  obtain  estimates  r*  of  optimal 
policy,  where  r*  =  (c,rf)*,  by  minimizing  the  Mean-Squared-Error  as: 


minMS'E^(r)  =  min  —  1/(5, r))^, 


where  is  the  true  value  at  state  s  under  optimal  policy,  V{s^r)  is  the  linear  approximation  as  defined  in 

Eq(3). 

At  iteration  step  t,  we  observe  a  new  example  st  ^  V^{st).  Stochastic  gradient-descent  methods  adjust  the 
parameter  vector  by  a  small  amount  in  the  direction  that  would  most  reduce  the  error  on  that  example: 


^^+1  -  rr^t 


E  +  jt[V\st)  -V{st,r')]\/Vist,r') 


(12) 
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Here  V  is  the  gradient  operator  with  respect  to  and  7^  is  a  positive  step-size  parameter.  Stochastic  approxi¬ 
mation  theory  [3]  requires  that  Ik  =  ^  and  YlT=i  ^ 

There  are  two  classes  of  simulation-based  learning  methods  to  obtain  r*,  viz.,  Monte-Carlo  and  Temporal- 
Difference  learning  methods [8].  These  methods  require  only  experience  -  samples  of  sequences  of  states,  actions, 
and  rewards  from  on-line  or  simulated  interaction  with  environment.  Learning  from  simulated  experience  is  powerful 
in  that  it  requires  no  a  priori  knowledge  of  the  environment’s  dynamics,  and  yet  can  still  attain  optimal  behavior. 
Monte-Carlo  methods  are  ways  of  solving  the  reinforcement  learning  problem  based  on  averaging  the  sample 
returns.  In  Monte  Carlo  methods,  experiences  are  divided  into  episodes,  and  it  is  only  upon  the  completion  of  an 
episode  that  value  estimates  and  policies  are  changed.  Monte-Carlo  methods  are  thus  incremental  in  an  episode-by- 
episode  sense.  In  contrast.  Temporal  Difference  methods  update  estimates  based  in  part  on  other  learned  estimates, 
without  waiting  for  a  final  outcome [8]. 

Monte-Carlo  method,  as  applied  to  the  patrolling  problem,  works  as  follows:  based  on  current  estimated  run 
one  pseudo-episode  (sufficiently  long  state  trajectory);  gather  the  observations  of  rewards  of  all  states  along  the 
trajectory;  apply  the  stochastic  gradient  descent  method  as  in  Eq(12)  to  obtain  Then,  repeat  the  process  until 
converged  estimates  (r*)  are  obtained.  A  disadvantage  of  Monte-Carlo  method  here  is  that,  for  infinite  MDP,  to 
make  the  return,  accurate  for  each  state,  the  episode  has  to  be  sufficiently  long;  this  would  result  in  large 

memory  requirement  and  a  long  learning  cycle. 

Temporal  Difference,  TD{0)  method,  as  applied  to  the  patrolling  problem  works  as  follows:  simulate  one  state 
transition  with  r^;  then  immediately  update  estimates  to  be  Define  dt  as  the  return  difference  due  to  transition 
from  state  5  to  5': 

dt  =  a{s,  s')  [g{s,  a,  s')  +  V (s',  r*)]  -  V (s,  r*)  (13) 

where  a(s,  s')  is  the  discount  factor  for  state  transition  from  s  to  s'.  The  TD{0)  learning  method  updates  estimates 
according  to  the  formula 

=  r*  +  7tdtVy(s,r*)  (14) 

A  disadvantage  of  TD{0)  as  applied  to  the  patrolling  problem  is  the  following.  Since  adjacent  states  are  always 
from  different  nodes,  rj  (rj  =  (cj^dj))  is  used  to  update  (i  /  j);  this  could  result  in  slow  convergence  or 
even  divergence. 

To  overcome  the  disadvantages  of  Monte-Carlo  and  TD{0)  methods,  while  exploiting  their  strengths  in  value 
learning,  we  design  a  new  learning  method,  termed  the  Similar  State  Estimate  Update  (SSEU).  We  define  states 
where  the  patrol  unit  is  located  at  the  same  node  as  being  similar,  e.g.,  si  =  (^,  Wi)  and  52  =  {hYl.2)  similar 
states.  Suppose  that  the  generated  trajectory  under  current  estimation  (c^  and  d^)  for  two  adjacent  similar  states  of 
node  i,  i.e.,  state  5  =  (i,  w^)  and  s'  =  (i,  is:  io(=  i){t^  0),  gi),  ^2(^2  92),  •••,  ^Ar(=  i){tN^  9n)-  Based 

on  this  sub-trajectory,  we  obtain  the  new  observations  of  for  nodes  j  =  A,  ^2,  •••,  as  follows: 


new 


(15) 
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and  the  new  observations  of 


N  In 

k=\  j=ii 


(16) 


Consequently,  the  parameters  Cjj  and  di  are  updated  by: 


=  c  •  •  H- 


^new  _ 

m. 


^new  _ 

S  ^ 


(17) 


where  Nf-  is  the  number  of  update  of  cij,  and  Nf  is  the  number  of  update  of  di. 

To  make  our  learning  algorithm  effective,  there  are  two  other  issues  to  consider.  First,  to  avoid  the  possibility  that 
some  nodes  are  much  less  frequently  visited  than  others,  we  apply  exploring  —  starts  rule,  where  we  intentionally 
begin  episodes  from  those  nodes  that  are  less  frequently  visited  based  on  the  simulation  histories.  Second,  to  escape 
from  local  minima,  we  employ  the  e-greedy  method.  The  simplest  action  selection  rule  is  to  select  the  action  with 
highest  estimated  action  value  as  in  Eq(lO).  This  method  always  exploits  current  knowledge  to  maximize  immediate 
reward,  and  it  spends  no  time  at  all  sampling  apparently  inferior  actions  to  verify  whether  they  might  be  profitable 
in  the  long  term.  In  contrast,  e-greedy  behaves  greedily  most  of  the  time,  but  every  once  in  a  while,  with  a  small 
probability  e,  selects  an  action  at  random,  uniformly,  and  independently  of  the  action-value  estimates.  In  e-greedy, 
as  in  Eq(18),  all  non-greedy  actions  are  given  the  minimal  probability  of  selection,  and  the  remaining  bulk 

of  the  probability,  1  —  e  +  is  given  to  the  greedy  action  [8],  where  |A(5)|  is  the  cardinality  of  action  set,  A{s) 
in  state  s.  This  enables  the  learning  method  to  get  out  of  local  minima,  and  thus  provides  the  balance  between 
exploitation  and  exploration. 

The  details  of  Similar  State  Update  learning  algorithm  can  be  found  in  Eig.l.  The  c*  and  d*  obtained  by  this 
method  can  provide  a  near-optimal  patrol  route  by  concatenating  greedy  actions  for  each  state,  as  described  in 
Eq(lO). 


C.  Strategy  for  Generating  Multiple  Patrolling  Routes 

In  this  section,  we  design  a  method  for  generating  multiple  satisfactory  routes  by  Softmax  action  selection  strategy. 
In  order  to  impart  virtual  presence  and  unpredictability  to  patrolling,  the  unit  needs  multiple  and  randomized  patrol 
routes.  We  employ  Softmax  action  selection  method[8],  where  the  greedy  action  is  still  given  the  highest  selection 
probability,  but  all  the  others  are  ranked  and  weighed  according  to  their  value  estimates.  The  most  common  softmax 
method  uses  a  Gibbs  distribution.  It  chooses  action  a  at  state  s  with  probability: 

g[Q*(5,a)-Q*]/T 

- - AQ-is,a')-Q^]/r^  where  Q*  =  max Q* (s,  a);  (19) 

where  ^(s)  denotes  the  set  of  feasible  actions  at  state  s,  and  Q*{s,a)  is  action-value  function  for  optimal  policy 

n*, 


Q*{s,  a)  =  a{s,  s'){E[g{s,  a,  s')]  -f 


(20) 
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Learning  Algorithm:  Similar  State  Estimate  Update 
(With  Exploring  —  Starts  and  6-greedy  rules) 

Initialize: 

c  =  0,  rf  =  0,  Frequencies  —  0 

Repeat 

-  Step  0  (Episode  Initialization):  beginning  with  an  empty  episode  p,  pick  up  a  node 
io  —  aigmin Frequencies  and  initialize  w  =  0,  append  the  state  s  —  (io?  Wq)  to 

P- 

Set  t  =  0, 

Frequencies{io)  +  +; 

-  Step  1  (Parameters  Update):  Get  the  last  node  of  episode,  i.e.,  s'  =  (^,  wO’ 
the  latest  similar  state  of  s'  in  p,  i.e.,  s  =  (i,  w),  if  no  such  node,  go  to  step  2; 
else  obtain  the  sub- trajectory  beginning  at  s  and  ending  at  s',  update  and 
as  in  Eq.(17),  then  go  to  step  2; 

-  Step  2  (Policy  Improvement):  Decide  the  action  for  state  s: 


T^^^keadj({)0:{s,s'){E[g{s,a  =  {i,k),s')]  +  V\s')}  w.p.  1  -  e, 


[  rand{adj{i)) 

set  A,  =  2M; 

calculate  w'  =  w  +  A^,  w'  =  0; 
update  t  —  t  + 

append  state  s'  =  to  episode  p; 

Frequencies{j)  +  +; 

if  p  is  sufficiently  long,  go  to  step  0;  else  go  to  step  1. 
until  c  and  d  converge. 

Fig.  1.  Similar  State  Estimate  Update  (Learning  Algorithm) 


w.p.  e, 


Here,  r  is  a  positive  parameter  called  temperature.  High  temperatures  cause  the  actions  to  be  nearly  equiprobable. 
Low  temperatures  cause  a  greater  difference  in  selection  probability  for  actions  that  differ  in  their  value  estimates. 
In  the  limit  as  r  — >  0,  softmax  action  selection  reverts  to  a  greedy  action  selection. 

IV.  Simulation  and  Results 

We  illustrate  our  approach  to  patrol  routing  using  a  simple  example  that  represents  a  small  county,  as  in  Eig.  2. 
The  nodes,  incident  rates  (A^)  and  importance  indices  {6i)  are  given  Table  I. 

The  results  for  patrolling  strategies  from  the  similar  state  estimate  update  (SSEU)  method  and  the  one-step  greedy 
strategy  are  compared  in  Table  II.  In  the  one-step  greedy  strategy,  at  each  state,  the  neighboring  node  which  results 
in  the  best  instant  reward  is  chosen  as  the  next  node,  i.e.,  j  =  argmaxv)JcGadj(i)  ^  —  (u  '^0]}* 

this  patrol  area  is  covered  by  one  patrol  unit,  the  expected  overall  reward  of  the  unit  following  the  route  obtained 
by  the  SSEU  method  is  2, 330  and  the  reward  per  unit  distance  is  17.4;  while  following  the  route  from  one-step 
greedy  strategy,  the  expected  overall  reward  is  1,474,  and  the  expected  reward  per  unit  distance  is  6.00.  If  this 
patrol  area  is  divided  into  two  sectors,  i.e.,  sector  a  and  sector  b,  as  in  Eig.  2,  the  SSEU  method  results  in  the 
following  rewards:  for  sector  a,  the  overall  expected  reward  is  1,  710  and  the  expected  reward  per  unit  distance  is 
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TABLE  I 

Example  Description 


node 

Ai 

Si 

node 

Ai 

Si 

node 

Ai 

Si 

node 

Ai 

Si 

N1 

2 

2 

N12 

2 

2 

N23 

2 

2 

N34 

8 

4 

N2 

2 

2 

N13 

2 

2 

N24 

2 

2 

N35 

2 

2 

N3 

2 

2 

N14 

2 

2 

N25 

2 

2 

N36 

2 

1 

N4 

2 

2 

N15 

2 

2 

N26 

2 

4 

N37 

6 

4 

N5 

2 

2 

N16 

2 

2 

N27 

1 

2 

N38 

2 

1 

N6 

2 

2 

N17 

3 

4 

N28 

2 

2 

N39 

2 

2 

N7 

2 

2 

N18 

2 

2 

N29 

2 

2 

N40 

4 

6 

N8 

4 

2 

N19 

1 

2 

N30 

2 

2 

N41 

2 

2 

N9 

2 

2 

N20 

4 

10 

N31 

2 

2 

N42 

2 

2 

NIO 

1 

2 

N21 

2 

1 

N32 

4 

2 

N43 

2 

2 

Nil 

1 

2 

N22 

1 

2 

N33 

2 

2 

Velocity  of  patrol  (t’):  1  unit  distance/  unit  time 

discount  rate  (f3):  0.1/unit  time 

19.43;  for  sector  b,  the  overall  expected  reward  is  1,471  and  the  expected  reward  per  unit  distance  is  13.8.  The 
one-step  greedy  strategy  results  in  the  following  rewards:  for  sector  a,  the  expected  overall  reward  is  1, 107,  and 
the  expected  reward  per  unit  distance  is  10.9;  for  sector  b,  the  expected  overall  reward  is  1,  238,  and  the  expected 
reward  per  unit  distance  is  8.94.  Thus,  patrol  routes  obtained  by  the  SSEU  method  are  highly  efficient  compared 
to  the  short-sighted  one-step  greedy  strategy  in  this  example.  In  this  scenario,  the  nodes  with  high  incident  rates  and 
importance  indices  are  spread  out  and  sparse.  Typically,  the  SSEU  method  is  effective  for  general  configurations 
of  patrol  area.  Another  observation  from  the  simulation  is  that  the  net  reward  from  sector  a  and  sector  6,  i.e., 
3,181,  with  two  patrolling  units,  is  36%  better  than  the  net  reward  (2,330)  when  there  is  only  one  patrol  unit. 
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Furthermore  when  a  unit  patrols  on  a  smaller  area,  higher  overall  reward  per  area  and  higher  reward  per  unit 
distance  are  expected.  After  applying  softmax  action  selection  method  on  the  near-optimal  strategy  from  SSEU 
method  on  sector  a,  we  obtained  multiple  sub-optimal  routes  for  this  sector;  four  of  them  are  listed  in  Table  III. 

TABLE  II 

Patrolling  Routes  Under  dieeerent  Strategies 


Strategy 

Patrol  Route 

Expected 

Reward 

Reward 

/distance 

I,  10,  20,  21,  31,  32,  33,  34,  41,  40,  39,  36,  37,  27,  26,  25,  28,  29,  30,  34, 

33,  32,  31,  21,  20,  19,  18,  17,  24,  29,  35,  40,  39,  38,  37,  36,  35,  34,  33,  32, 

SSEU 

31,  21,  20,  10,  I,  2,  3,  8,  12,  13,  17,  24,  23,  30,  34,  41,  40,  39,  36,  37,  27, 

2,330 

17.4 

(whole 

26,  16,  15,  6,  5,  4,  7,  8,  9,  II,  10,  20,  21,  31,  32,  43,  42,  41,  34,  35,  40, 

county) 

39,  38,  37,  36,  28,  25,  24,  17,  18,  19,  20,  21,  22,  23,  30,  34,  33,  32,  43,  42, 

41,  40,  39,  36,  37,  27,  26,  16,  15,  14,  13,  17,  18,  19,  20,  10,  (I) 

I,  9,  8,  7,  8,  3,  2,  I,  9,  8,  12,  18,  17,  13,  14,  15,  6,  5,  4,  7,  8,  3,  2,  I,  9,  II, 

12,  18,  17,  24,  25,  26,  16,  15,  14,  13,  17,  18,  23,  30,  34,  33,  32,  31,  21,  20, 

10,  I,  2,  3,  8,  7,  6,  5,  4,  3,  8,  9,  II,  12,  13,  17,  24,  29,  28,  25,  26,  16,  15,  14, 

1,474 

6.00 

One- step 

7,  8,  3,  2,  I,  9,  8,  12,  18,  17,  13,  14,  15,  6,  5,  4,  7,  8,  3,  2,  I,  10,20,  19,  22, 

Greedy 

23,  30,  34,  41,  40,  35,  36,  37,  27,  26,  25,  24,  17,  18,  12,  8,  9,  II,  10,  20,  21, 

(whole 

31,  32,  43,  42,  33,  34,  30,  29,  28,  25,  26,  16,  15,  14,  13,  17,  24,  23,  18,  12,  8, 

county) 

3,  2,  9,  II,  19,  20,  10,  20,  21,  31,  32,  43,  42,  41,  40,  39,  38,  37,  36,  35,  34, 

33,  32,  31,  21,  20,  19,  22,  23,  30,  34,  41,  40,  39,  40,  35,  34,  33,  32,  43,  42, 

41,  40,  39,  38,  37,  27,  26,  25,  28,  29,  24,  17,  13,  7,  4,  5,  6,  15,  16,  26, 

25,  28,  36,  37,  38,  39,  40,  35,  34,  30,  23,  18,  12,  8,  3,  2,  (I) 

I,  10,  20,  21,  31,  32,  33,  34,  30,  23,  18,  19,  20,  21,  31,  32,  33,  34,  41,  42, 

SSEU 

43,  32,  31,  21,  20,  10,  II,  9,  8,  12,  18,  23,  30,  34,  33,  32,  31,  21,  20,  19, 

1,710 

19.43 

(sector  a) 

18,  23,  30,  34,  41,  42,  43,  32,  31,  21,  20,  10,  II,  12,  8,  3,  2,  9,  8,  12,  18, 

23,  30,  34,  33,  32,  31,  21,  20,  19,  22,  23,  30,  34,  41,  42,  43,  32,  31,  21,  20, 

10,  II,  12,  8,  9,  (I) 

I,  9,  8,  3,  2,  9,  8,  12,  18,  23,  30,  34,  33,  32,  31,  21,  20,  10,  20,  19,  20,  21, 

20,  10,  20,  19,  20,  21,  20,  10,  20,  19,  18,  12,  8,  3,  2,  9,  II,  12,  8,  3,  2,  9, 

one- step 

8,  12,  18,  23,  30,  34,  41,  42,  43,  32,  33,  34,  30,  34,  41,  34,  33,  32,  31,  21, 

1,107 

II.O 

greedy 

20,  10,  20,  19,  22,  23,  18,  12,  8,  3,  2,  9,  II,  10,  20,  21,  20,  19,  20,  10,  20, 

(sector  a) 

21,  31,  32,  43,  42,  41,  34,  30,  23,  18,  12,  8,  3,  2,  9,  II,  19,  20,  10,  (I) 

4,  5,  6,  7,  13,  17,  24,  29,  35,  40,  39,  36,  37,  27,  26,  25,  28,  29,  35,  40,  39,  38, 

SSEU 

37,  27,  26,  16,  15,  14,  13,  17,  24,  29,  35,  40,  39,  36,  37,  27,  26,  25,  28,  29,  35, 

(sector  b) 

40,  39,  38,  37,  27,  26,  16,  15,  6,  5,  4,  7,  13,  17,  24,  29,  35,  40,  39,  36,  37,  27, 

26,  25,  28,  29,  35,  40,  39,  38,  37,  27,  26,  16,  15,  14,  13,  17,  24,  29,  35,  40,  39, 

36,  37,  27,  26,  25,  28,  29,  35,  40,  39,  38,  37,  27,  26,  16,  15,  6,  7,  (4) 

1,471 

13.8 

4,  5,  4,  7,  6,  15,  14,  13,  17,  24,  25,  26,  16,  15,  6,  5,  4,  7,  14,  13,  17,  24,  29, 

28,  25,  26,  16,  15,  6,  5,  4,  7,  14,  13,  17,  24,  29,  35,  40,  39,  36,  37,  27,  26,  25, 

one- step 

28,  29,  24,  17,  13,  7,  6,  15,  16,  26,  25,  28,  29,  35,  40,  39,  38,  37,  36,  37,  27, 

1,238 

8.94 

greedy 

26,  16,  15,  14,  13,  17,  24,  25,  28,  29,  35,  40,  39,  40,  35,  40,  39,  40,  35,  40, 

(sector  b) 

39,  38,  37,  36,  28,  25,  26,  16,  15,  6,  5,  (4) 

V.  Summary  and  future  work 

In  this  paper,  we  considered  the  problem  of  effective  patrolling  in  a  dynamic  and  stochastic  environment.  The 
patrol  locations  are  modeled  with  different  priorities  and  varying  incident  rates.  We  identified  a  solution  approach, 
which  has  two  steps.  First,  we  partition  the  set  of  nodes  of  interest  into  sectors.  Each  sector  is  assigned  to  one 
patrol  unit.  Second,  for  each  sector,  we  exploited  a  response  strategy  of  preemptive  call-for-service  response,  and 
designed  multiple  off-line  patrol  routes.  We  applied  the  MDP  methodology  and  designed  a  novel  learning  algorithm 
to  obtain  a  deterministic  optimal  patrol  route.  Furthermore,  we  applied  Softmax  action  selection  method  to  device 
multiple  patrol  routes  for  the  patrol  unit  to  randomly  choose  from.  Future  work  includes  the  following:  a)  considing 


TABLE  III 

Multiple  Patrolling  Routes 
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Route 

Patrol  route 

Expected 

Reward 

Reward 

/distance 

Route  I: 

1,  10,  20,  21,  22,  23,  30,  34,  33,  32,  43,  42,  33,  34,  41,  42,  33,  34,  30,  23,  22, 

19,  20,  10,  11,  12,  8,  9,  2,  3,  8,  12,  11,  19,  20,  21,  31,  32,  43,  42,  41,  34,  33, 

32,  31,  21,  22,  23,  18,  19,  20,  10,  11,  12,  8,  9,  2,  3,  8,  12,  18,  23,  30,  34,  41, 

42,  33,  34,  30,  23,  18,  19,  22,  21,  20,  19,  18,  23,  30,  34,  33,  32,  31,  21,  20,  10, 
11,  9, 

1,525 

17.05 

Route  II: 

1,  10,  20,  21,  22,  19,  20,  21,  31,  32,  43,  42,  41,  34,  33,  32,  31,  21,  20,  10,  11, 

9,  8,  3,  2,  9,  8,  12,  18,  23,  22,  19,  20,  10,  11,  9,  8,  12,  18,  23,  30,  34,  41,  42, 

33,  32,  31,  21,  20,  19,  22,  23,  30,  34,  41,  42,  33,  32,  43,  42,  33,  34,  30,  23,  18, 

19,  22,  23,  18,  19,  20,  10,  11,  12,  8,  9,  11,  12,  18,  23,  30,  34,  33,  32,  31,  21, 

20,  19,  11,  12,  8,  3,  2,  9,  8,  3,  2,  (1) 

1.831 

18.68 

Route  III: 

1,  2,  9,  11,  10,  20,  19,  11,  9,  8,  3,  2,  9,  8,  3,  2,  9,  11,  10,  20,  21,  31,  32,  43, 

42,  41,  34,  33,  32,  43,  42,  33,  34,  30,  23,  18,  12,  8,  9,  11,  19,  20,  21,  31,  32, 

33,  34,  30,  23,  22,  19,  20,  10,  11,  12,  8,  9,  2,  3,  8,  9,  2,  3,  8,  12,  18,  23, 

30,  34,  41,  42,  43,  32,  33,  34,  30,  23,  18,  12,  11,  19,  20,  10,  11,  9,  (1) 

1,300 

15.22 

Route  VI: 

1,  2,  9,  8,  12,  18,  23,  22,  21,  20,  19,  18,  12,  8,  3,  2,  9,  8,  12,  18,  23,  30, 

34,  41,  42,  33,  32,  31,  21,  20,  10,  11,  9,  8,  12,  18,  23,  22,  19,  20,  10,  11, 

12,  18,  23,  30,  34,  33,  42,  43,  32,  31,  21,  20,  10,  11,  12,  8,  3,  2,  9,  8,  12, 

18,  19,  20,  10,  11,  9,  8,  12,  18,  19,  22,  23,  30,  34,  41,  42,  33,  32,  31,  21, 

20,  10,  11,  9,  8,  3,  2,  9, 

1,389 

15.20 

the  incident  processing  time  and  resource  requirement  at  each  node;  b)  including  patrol  unit’s  resource  capabilities 
in  the  patrolling  formulation;  c)  and  applying  adaptive  parameter  updates  for  incident  rates  and  importance  rates 
at  each  node. 
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■  Introduction 

■  Stochastic  Patrolling  Model 

■  Our  Proposed  Solution 

■  Simulation  Results 


Summary  and  Future  Work 


■  Motivation 


►  Preventive  patrolling  is  a  major  component  of  stability  operations  and 
crime  prevention  in  highly  volatile  environments 

►  Optimal  resource  allocation  and  planning  of  patrol  effort  are  critical 
to  effective  stability  and  crime  prevention  due  to  limited  patrolling 
resources 

■  Model  and  Design  Objective 

►  Introduce  a  model  of  patrolling  problems  that  considers  patrol  nodes  of 
interest  to  have  different  priorities  and  varying  Incident  rates 

►  Design  a  patrolling  strategy  such  that  the  net  effect  of  randomized 
patrol  routes  with  immediate  call-for-service  response  allows  limited 
patrol  resources  to  provide  prompt  response  to  random  requests, 
while  effectively  covering  the  entire  nodes 
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II  Consider  a  finite  set  of  nodes  of 
interest:  N  =  {/;  i=\,.,.,n} 

i  Each  node  i  has  the  following 
attributes: 

►  Fixed  location:  {Xp 

►  Incident  rate:  (incidents/hour) 
^  assume  a  Poisson  process 

►  Important  index:  5^ 

^  indicate  relative  importance 

of  node  i  in  the  patrolling  area 

■>  Assume  r  patrol  units 

each  with  average  speed  v 


n 

n 

r 

P'"’  \\ 

I3nt2.1|  j 

igi(2.2)  1 

lgQP.2)  j 

F - 1| 

_ 1, 

■  step  1 :  Partition  the  set  of  nodes  of  interest  into  sectors  -  subsets  of 
nodes.  Each  sector  is  assigned  to  one  patrol  unit. 

^  Sector  partitioning  sub-problem 


■  Step  2:  Utilize  a  response  strategy  of  preemptive  call-for-service 

response  and  design  multiple  off-line  patrol  routes  for  each 
sector 

►  Step  2.1 :  Response  strategy 

■  Put  higher  priority  to  call-for-service  requests  ^  stop  current  patrols 
and  respond  to  the  requests 

■  Resume  suspended  patrols  after  call-for-service  completion 

►  Step  2.2:  Off-line  route  planning  sub-problem 

■  Optimal  routing  in  a  sector  <=  Similar  State  Estimate  Update  (SSEU) 
in  Markov  Decision  Process  framework 

■  Strategy  for  generating  multiple  patrol  routes  <=  randomized 
(“softmax”)  action  selection  method 
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step  1 :  Sector  Partitioning  Sub-problem 


The  problem  is  formulated  as  a  political 
districting  problem: 

►  Let  the  finite  set  of  nodes  of  interest  form 
a  region 

►  Each  node  in  the  region  is  centered  at 
{Xp  y^,  and  has  an  importance  value  of 

Pi  =  4 

►  Define  r  areas  (commensurate  to  the 
number  of  patrol  units)  over  the  region 
such  that: 


^  All  nodes  are  covered  with 
minimum  overlaps 

^  Similar  sums  of  importance  values 
between  areas 

^  Geography  of  the  areas  must  be 
compact  and  contiguous 
This  problem  has  been  extensively  studied  in  combinatorial  optimization  [Garfinkel1970]. 


Si  :  Important  index  of  node  i 
Ai :  Incident  rate 


2.2:  Off-line  Route  Planning  Sub-problem 

Markov  Decision  Process  (MDP)  Representation 


►  States  {s}: 

■  A  state  is  denoted  by  ^  =  {i,w} 

■  i  represents  the  node  that  has  been  most  recently  cleared  by  a  patrol  unit  (and  /  is  also 
the  current  location  of  the  patrol  unit) 

■  w={wj^}^j^=j  denotes  elapsed  time  of  all  nodes  since  last  visits  from  the  patrol  unit 


►  Action  {«}: 

■  An  action  is  denoted  by  a  =  (ij) 

■  7  (  ^  0  is  an  adjacent  node  of  the  next  node  to  be  visited 


►  Reward  g(s,a,s^  : 

Define  the  reward  for  taking  action  a  =  (iJ)  at  state  5  =  {i,w} 
to  reach  next  state  '  =  {/,  w 

►  Discount  mechanism: 

■  The  reward  g  potentially  earned  at  time  is  valued  as  ge~  at  time  t,  where  /5\s 
the  discount  rate 

■  Encourage  prompt  actions 

►  Objective: 

Determine  an  optimal  policy,  i.e.,  a  mapping  from  states  to  actions,  that  maximizes 
the  overall  expected  reward 
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2.2:  Off-line  Route  Planning  Sub-problem 
Linear  State  Value  Structure 


■  Arbitrary  MDP  problems  are  intractable 

■  Fortunately,  our  patrolling  problem  exhibits  a  special  structure:  linearity 

►  For  any  deterministic  policy  in  the 
patrolling  problem,  the  state  value 
function  has  the  property: 


State  Value  function,  is 

the  expected  reward  starting  from 
state  5,  under  policy  H. 


F"  (5  =  (/,  w))  =  (£ ."  (5))^  w  +  (5)  V/  e  N 

linear  w.r.t.  w  (elapsed  time  of  nodes  since  last  visits  from  a  patrol  unit ) 


►  Thus,  a  linear  approximation  of  state  value  function  for  optimal  policy  is: 

P  (s  =  (i,w))  =  (c.f  w+d\ 


►  The  problem  becomes  one  of  finding  c*-,  J*-,  V/gN  ^  determine  the 
optimal  policy 
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2.2. a:  Optimal  Routing  in  a  Sector 
Similar  State  Estimate  Update  Method 

Introduce  a  variant  of  Reinforcement  Learning  (RL)  method,  Similar  State  Estimate 
Update  (SSEU)  method,  to  learn  the  optimal  parameters  c*-  and  J*-,  V/gN 

►  Reinforcement  learning  is  a  simulation-based  learning  method,  which 
requires  only  experience,  i.e.,  sample  of  sequences  of  states,  actions  and 
rewards  from  on-line  or  simulated  interaction  with  the  system  environment 

►  Given  an  arbitrary  policy,  Yl,  policy  iteration  method  of  RL  iteratively 
improves  the  policy  to  gradually  approach  Yl*  as  follows: 


k*  =  axg  max  a(5,5'){£'[e(5',a  =  (/,A:),5')]  +  F’’(5'')} 

Vae{i,k),  keadj(i)  . 


^distjij) 


a{s,  s  )  =  e  ^ 

Discount  from  5  to  S  ’ 


V"(s’)  =  (l,yw  +  d, 


p:  discount  rate 
V :  average  speed 
a :  action 


Reward  for  taking  action  a  at 
state  s,  and  reaching  state  5^ 


state  value  of 
under  Yl 


2.2.a:  Optimal  Routing  in  a  Sector 
Similar  State  Estimate  Update  Method  -2 


►  Generate  a  trajectory  via  policy  iteration  utilizing  current 
parameter  estimates,  and  d\  for  two  adjacent 
similar  states  of  node  /,  state  s={i,  w'}\ 


Similar  States:  same 
node  location,  different 
visitation  time 


#-o  w  (elapsed  time  of  nodes  since  last  visits  from  a  patrol  unit ) 

j  represents  a  node  along  the  trajectory 


Evaluate  new  values  of  C  and  a  ^  4.  r-  4.  i.- 

p- i_vaiuaic  new  vaiuc^o  cxiikj  dsnotss  tho  first  time  node  j  is 

1  visited  in  the  trajectory;  and 

^  new  c  0 

C  -  =  O  -/I  ■€ 

U  J  J 


^ij 


=  S-A 


J  J 


m 


+a{s,s')V\s')-{c”‘^)w 

j  ^  T rt  /'  l\  /  t  \  T  .  jt 


V  (s')  =  {c  j  w  +  d; 


Sj :  Important  index  of  node  j 
Aj :  Incident  rate 
p:  discount  rate 


►  Thus 


,  c"''*  -  c‘. 

c‘^^=c‘.+^ - ^ 

V 


and 


m^.. :  number  of  c,-,-  previous  updates  :  number  of  previous  updates 

ij  Lj  i  L 
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2.2. b:  Strategy  for  Generating 
Multiple  Patrolling  Routes 


■  Why  multiple  patrolling  routes? 

►  To  impart  virtual  presence  and  unpredictability  to  patrolling 
=>  the  patrol  unit  randomly  selects  one  of  many  patrol  routes 


■  Softmax:  random  action  selection  method 

►  At  each  state, 

-  The  best  action  is  given  the  highest  selection  probability 

-  The  second  best  action  is  given  lesser  probability 

-  The  third  best  action  is  given  even  less  and  ... 

►  Temperature  -  tunable  parameter  -  decides  probability  differences 
among  the  actions 

-  High  temperatures  =>  virtually  equal  probability 

-  Low  temperatures  =>  greater  difference  in  selection  probabilities 
for  actions  having  different  value  estimates 
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■  Results  from  the  Illustrative  Patrol  Problem 


Range 

Method 

Expected 

Reward 

Reward  per 
Unit 

Distance 

Whole  Region 

SSEU 

2,330 

17.4 

Greedy 

1,474 

6.0 

Sector  a 

SSEU 

1,710 

19.43 

Greedy 

1,455 

13.8 

Sector  b 

SSEU 

1,471 

13.8 

Greedy 

1,107 

10.9 

ff  Reward: 

ff  Number  of  cleared  incidents 
ft  Incident  importance 
It  Latency 


Greedy  refers  to  one-step  greedy 
strategy,  i.e.,  for  each  state, 
select  the  neighboring  node  with 
best  instant  reward 


►  Patrol  routes  obtained  by  the  SSEU  method  are  highly  efficient  compared  to  the 
one-step  greedy  strategy 

►  Net  reward  from  two  patrolling  units  (for  sectors  a  and  b)  is  36%  higher  with  the 
SSEU  method  when  compared  to  that  of  one  patrol  unit  in  the  whole  region 
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■  Present  an  analytical  model  of  patrolling  problem  with  varying  incident 
rates  and  priorities 

■  Propose  a  solution  approach  in  two  steps: 

■  Step  1:  Solve  the  sector  partitioning  sub-problem  via  Political 
Districting  Method  =>  assign  each  sector  to  one  patrol  unit 

■  Step  2:  Utilize  a  response  strategy  of  preemptive  call-for-service 
and  define  an  optimal  and  near-optimal  patrol  routes  for  each  sector 
via  SSEU  and  “softmax”-based  method,  respectively 

■  Future  work: 

■  Incorporate  incident  processing  time  and  resource  requirements  for 
each  node 

■  Include  patrol  unit’s  resource  capabilities  and  workload  constraints 

■  Introduce  dynamic  rerouting  in  the  presence  of  changes  in  the 
incident  rates  and  node  priorities 
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Thank  You  ! 


